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Abstract 

Background: Genotype imputation is a cost efficient alternative to use of high density genotypes for implementing 
genomic selection. The objective of this study was to investigate variables affecting imputation accuracy from low 
density tagSNP (average distance between tagSNP from 1 0Okb to 1 Mb) sets in swine, selected using LD information, 
physical location, or accuracy for genotype imputation. We compared results of imputation accuracy based on several 
sets of low density tagSNP of varying densities and selected using three different methods. In addition, we assessed 
the effect of varying size and composition of the reference panel of haplotypes used for imputation. 

Results: TagSNP density of at least 1 tagSNP per 340kb (~ 7000 tagSNP) selected using pairwise LD information was 
necessary to achieve average imputation accuracy higher than 0.95. A commercial low density (9K) tagSNP set for 
swine was developed concurrent to this study and an average accuracy of imputation of 0.951 based on these tagSNP 
was estimated. Construction of a haplotype reference panel was most efficient when these haplotypes were obtained 
from randomly sampled individuals. Increasing the size of the original reference haplotype panel (128 haplotypes 
sampled from 32 sire/dam/offspring trios phased in a previous study) led to an overall increase in imputation accuracy 
(IA — 0.97 with 512 haplotypes), but was especially useful in increasing imputation accuracy ofSNP with MAF below 
0.1 and for SNP located in the chromosomal extremes (within 5% of chromosome end). 

Conclusion: The new commercially available 9K tagSNP set can be used to obtain imputed genotypes with high 
accuracy, even when imputation is based on a comparably small panel of reference haplotypes (1 28 haplotypes). 
Average imputation accuracy can be further increased by adding haplotypes to the reference panel. In addition, our 
results show that randomly sampling individuals to genotype for the construction of a reference haplotype panel is 
more cost efficient than specifically sampling older animals or trios with no observed loss in imputation accuracy. We 
expect that the use of imputed genotypes in swine breeding will yield highly accurate predictions of GEBV, based on 
the observed accuracy and reported results in dairy cattle, where genomic evaluation of some individuals is based on 
genotypes imputed with the same accuracy as our Yorkshire population. 
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Background 

Recent advances in genotyping technology have facilitated 
the availability of high density genotyping platforms in 
many livestock species. High density platforms including 
several thousand single nucleotide polymorphisms (SNP) 
are available for cattle [1-3], chicken [4], sheep [5], and pig 
[6]. 

These platforms can be used to increase the efficiency 
and accuracy of breeding programs by implementing 
genomic selection [7,8]. Using SNP data to inform breed- 
ing decisions allows animal breeders to select breeding 
stock prior to the animals having progeny of their own, 
thereby accelerating genetic progress through shortened 
generation intervals [7,8]. 

Currently, genomic selection has been successfully 
implemented in dairy cattle based on genotypes from the 
Illumina BovineSNP50 chip [7]. In an effort to increase 
cost efficiency, the use of low density (tagSNP) genotyping 
platforms was exploited for dairy cattle [9,10]. If high den- 
sity genotypes are imputed from tagSNP with high accu- 
racy, the loss of reliability of predicted genomic breeding 
values is minimal [9-11]. High accuracy of imputed geno- 
types depends on the selection of tagSNP, as well as the 
composition and size of the reference panel of haplotypes 
used for imputation. 

If close relatives of all imputation candidates are geno- 
typed at high density, untyped markers can be recovered 
through linkage and segregation analysis [12], where hap- 
lotypes can be traced through generations of directly 
related individuals using the rules of Mendelian inheri- 
tance. However, in some species it may not be feasible to 
genotype a large proportion of the pedigree at high den- 
sity. In that case a small panel of reference haplotypes 
can be used to impute all untyped markers by exploit- 
ing population-wide linkage disequilibrium (LD) [13,14]. 
This approach was initially proposed in human genome- 
wide association studies (GWAS) and has recently found 
application in plant [15] and animal breeding [10,11,16]. 
A combination of imputation based on segregation anal- 
ysis and population-wide LD is currently being used in 
dairy breeding [9]. While combining both approaches 
will increase accuracy of imputation, eventually becom- 
ing the default method, cost-effective implementation of 
genomic selection in novel populations is likely to ini- 
tially rely more on LD based imputation. Consequently, 
in this paper we will concentrate on LD based imputation 
by investigating tagSNP selection and haplotype reference 
panel construction. 

Human geneticists have proposed a variety of 
approaches to select an optimal low density set of tagSNP 
to achieve cost efficient imputation in GWAS [17]. These 
approaches include statistical criteria based on a pairwise 
threshold of LD between SNP (i.e. [18]) and predictive 
ability, selecting tagSNP that provide the most accurate 



prediction of all non-typed markers [19]. On the other 
hand, tagSNP sets used in livestock are mainly selected 
for equidistant spacing based on physical position along 
the genome, and high minor allele frequency (MAF) to 
ensure segregation (e.g. [1]). 

Crucial to successful implementation of genotype impu- 
tation using population wide LD is the availability of a rep- 
resentative panel of reference haplotypes [20,21]. These 
panels are commonly built by genotyping a small number 
of trios or a larger number of relatively unrelated individ- 
uals. The overall goal in either case is to collect genotypes 
that can be accurately phased [22] into haplotypes repre- 
sentative of population frequencies. As a result, we began 
our study by genotyping and phasing a small number of 
trios in four US pig breeds [23] (Nthos ~ 30) and further 
enriching this panel for the Yorkshire breed with a set of 
high density genotypes from largely unrelated individuals 

(^samples = 889). 

The objective of this study was to develop guide- 
lines for the implementation of genotype imputation in 
livestock populations having little or no prior use of 
genome-wide marker-assisted-selection. First, we com- 
pared imputation accuracy resulting from three methods 
of tagSNP selection using Yorkshire pigs genotyped with 
a high density SNP set (Illumina PorcineSNP60). This 
includes a report on imputation accuracy of the recently 
developed commercially available 9K tagSNP set referred 
to as the GeneSeek Genomic Profiler for Porcine LD 
(GGP-Porcine, GeneSeek a Neogen Company, Lincoln, 
NE). Second, we assess accuracy of imputation based on 
an increasing number of reference haplotypes to inform 
the selection of an optimal reference panel of haplo- 
types. Finally, we discuss imputation accuracy as a func- 
tion of chromosomal location and MAF of non-observed 
SNP. 

Methods 

Genotypes 

High density genotypes for approximately 30 
sire/dam/offspring trios were obtained and phased for 
each of four breeds of pigs (Duroc, Hampshire, Landrace, 
Yorkshire) in a previous study [23]. To ensure accurate 
phasing, the reference panel for imputation used in 
this study was the 128 haplotypes from the Yorkshire 
sire/dam pairs previously genotyped as parents in those 
trios. Animal protocols were approved by the Michigan 
State University All University Committee on Animal Use 
and Care (AUF# 03/09-046-00). The haplotypes of these 
animals are freely available at https://www.msu.edu/~ 
steibelj/JP_files/LD_estimate.html. 

Detailed information about data cleaning procedures, 
descriptive statistics of LD, and correlation of phase 
between Yorkshire and other US pig breeds can be found 
in Badke et al. [23]. In addition, DNA samples were 
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collected from 920 Yorkshire pigs and sent to a com- 
mercial laboratory (GeneSeek, a Neogen Company, Lin- 
coln, NE) to be genotyped on the Illumina PorcineSNP60 
(Number of markers M=62,163) Genotyping BeadChip 
(Illumina Inc.) [6]. Only animals with more than 90% 
genotype call rate were considered for analysis, result- 
ing in 889 animals used as the testing panel for this 
study. All SNP included in the 128 haplotype Yorkshire 
reference panel were used for analysis. All data from 
this study is available at https://www.msu.edu/~steibelj/ 
JP_files/imputation.html. 

In our previous study [23] we reported breed specific 
LD and persistence of phase among breeds for Duroc, 
Hampshire, Landrace, and Yorkshire pigs. We found that 
persistence of phase between Yorkshire and the other 
breeds ranged between 0.42 and 0.57 for SNP spaced 
approximately 1MB apart [23]. As a result the amount of 
LD within the Yorkshire breed that could be recovered 
through haplotypes from another breed ranges between 
0.18 and 0.33, such that adding haplotypes of a second 
breed to impute Yorkshire genotypes did not appear to be 
beneficial. For genomic selection, a simulation study con- 
ducted by de Roos et al. [24] found that persistence of 
phase between breeds needs to be much larger than the 
reported value between Yorkshire and any of the other 
three breeds to implicate any advantage for the use of 
mixed breed training panels. For this reason we decided 
to use only Yorkshire haplotypes in the reference panel for 
imputation in this paper. 

Genotype imputation and estimation of imputation 
accuracy 

All imputations in this study were done using BEAGLE 
version 3.3.1 [13], a genotype imputation software that 
uses a reference panel of haplotypes to estimate phase and 
impute missing genotypes in a set of unrelated individu- 
als. Beagle was run separately for each chromosome using 
128 reference haplotypes from the trio design [23] (phased 
file) to phase and impute genotypes of the 889 un-phased 
testing animals. All SNP, except tagSNP, were masked as 
missing in the testing set. Beagle was run for ten itera- 
tions of the phasing algorithm, drawing four samples per 
iteration. Previous results from another study [16], as well 
as a short experiment conducted in this study (data not 
shown) found no increase in imputation accuracy when 
the number of iterations or samples per iteration were 
increased. The output files from BEAGLE contained the 
most likely imputed genotypes (AA, AB, BB), posterior 
genotype probabilities (P(AA), P(AB), P(BB)), and poste- 
rior expected allelic dosage of the B allele derived from 
the posterior genotype probabilities (i.e. 0 * P(AA) + 1 * 
P(AB) + 2* P{BB)) [25]. 

Imputation accuracy was estimated using three differ- 
ent measures that reflect different influences of MAF 



and error counting. The proportion of correctly imputed 
alleles was computed as 



IA 



M Ni 

E E \gi) 

i=l ;'=1 



■gij\ 



M 

(=1 



(1) 



where gn is the observed allelic dosage of SNP i in indi- 
vidual gy is the corresponding posterior expected allelic 
dosage obtained from BEAGLE output, M is the total 
number of imputed SNP, and AT; is the number of individ- 
uals with called genotypes for SNP i. This overall measure 
of imputation accuracy can be further decomposed into 
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of imputation accuracy will be biased upwards, espe- 
cially for SNP with low MAF, because even if imputation 
ignores LD information and is based solely on allele fre- 
quency, the major allele would be correctly imputed for 
a large proportion of genotypes [15,16]. As tagSNP den- 
sity decreases imputation accuracy of rare alleles further 
decreases as rare haplotypes become harder to identify 
due to longer sequences of SNP missing [15]. Estimat- 
ing the total percentage of correctly imputed alleles for 
SNP with low MAF will be biased due to the large num- 
ber of correctly imputed major alleles masking the small 
number of misspecified minor alleles, which can be over- 
come through the use of a more sensitive measure of 
accuracy for these SNP [15]. In addition, if individuals 
carrying the minor allele are not correctly identified and 
their phenotype cannot be matched for GWAS this rela- 
tively small proportion of incorrectly imputed alleles will 
further decrease power. A variety of measures have been 
introduced to obtain estimates of imputation accuracy 
unbiased by MAF [15,16,26]. We estimated the propor- 
tion of correctly imputed alleles adjusted for MAF using 
the formula presented by Hayes et al. [16]: 

IA IAp re q 
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where IA is computed as described in equation (1) and 
IAf req is the accuracy of imputation based on genotypic 
frequencies estimated as: 

IAfreq = P(AA) ref * p(AA) va i + p(AB) ref *p(AB) vai 
+ p(BB) ref *p(BB) va i 

(3) 

where p(AA) re f it p(AE) re f i , and piBB),-^ are the observed 
frequencies for genotypes AA, AB, and BB for SNP i 
in the reference haplotypes and p(AA) va i v p(AB) va i i , and 
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p{BE) va i i are the predicted genotypic frequencies in the 
testing population for SNP i. IAp req can be interpreted as 
the expected probability of correctly imputing a genotype 
in the testing population by assigning a randomly sampled 
genotype from the haplotypes in the reference panel. This 
measure was computed on a SNP-wise basis and averaged 
across all SNP. To account for a slightly different number 
of genotypes observed within each SNP (due to missing 
at random) the average was obtained by weighting the 
accuracy of each SNP by the number of individuals with 
observed genotypes within each SNP. 

Alternatively, another measure of imputation accuracy 
robust to MAF is the squared correlation between the 
observed and imputed allelic dosage [15]. The correlation 
was obtained on a SNP by SNP basis using the correlation 
function in R [27]. SNP wise correlation measures were 
weighted by the number of available observations within 
the SNP to obtain an overall average imputation accuracy. 

Methods of tagSNP selection 

TagSNP were selected using three approaches: 1) evenly 
spaced based on physical position, 2) based on minimum 
pairwise LD with non-tagSNP (statistical selection), and 
3) based on marker predictive ability to accurately impute 
non-observed SNP genotypes (predictive selection). 

To select evenly spaced SNP the total length of each 
chromosome was partitioned into segments correspond- 
ing to the total number of tagSNP to be selected. Then, 
within each segment the SNP closest to the segment 
center was identified and added as a tagSNP. If a given seg- 
ment was empty, no tagSNP was selected in that segment. 

To implement a statistical search for tagSNP [19] we 
used the freely available software package FESTA [18]. 
FESTA performed a greedy search, where each SNP i was 
either an element of the tagSNP set or in LD higher than a 
threshold (rj ) with an existing element of the tagSNP set. 
FESTA was run repeatedly for increasing rf ranging from 
0.1 to 0.9 in 0.1 increments using estimates of LD based 
on 128 reference haplotypes. 

To implement predictive tagSNP selection, we applied 
the following forward search algorithm: First, we split the 
64 Yorkshire reference animals into a randomly sampled 
set of 10 individuals (training set) and 54 individuals (ref- 
erence haplotypes). Second, all SNP except one tagSNP 
in the training set were masked and imputed using the 
reference haplotypes. Third, accuracy of all imputed SNP 
was estimated and saved. Steps two and three were then 
repeated until all estimates of imputation accuracy were 
available for all potential tagSNP (at first, the potential 
tagSNP are all SNP on the chromosome). Fourth, the SNP 
that yielded the highest average accuracy of imputation 
among those not already chosen as tagSNP was selected as 
a new tagSNP. Steps two through four were repeated until 
the maximum number of tagSNP or a target imputation 



accuracy were reached. Because of the high computational 
demand of this methodology, this approach was only 
applied to the smallest available chromosome (SSC18), 
selecting tagSNP from 786 candidate SNP. 

Concurrent to this research, a set of 9390 tagSNP 
(Release Date: April 2012) was assembled by GeneSeek 
(Lincoln, NE) for the development of a commercial plat- 
form for low density genotyping in swine. This assay 
has been marketed as the GeneSeek Genomic Profiler 
for Porcine LD (GGP-Porcine; GeneSeek, Lincoln, NE). 
After production, the GGP-Porcine contains approxi- 
mately 8500 tagSNP (Jeremy Walker, personal communi- 
cation). TagSNP covering the entire genome were selected 
based on MAF in 13 commercial lines of pigs repre- 
sented by four breeding companies and four purebred 
populations. The MAF were provided by the breeding 
companies (identified simply as company A, B, C, and 
D). The number of lines provided by these companies 
were 1, 1, 4, and 7. Additional estimates of MAF used to 
identify tagSNP were obtained from our previous study 
[23] of four pure breeds: Duroc, Hampshire, Landrace, 
and Yorkshire. The freely available SNPspace software 
(CP. Van Tassell, unpublished data) was used to select 
tagSNP. SNPspace was initially developed to select SNP 
for the Illumina BovineSNP50 beadchip [2]. The con- 
ceptual framework of SNPspace is briefly described in 
that study [2], but additional features have been added 
since that time. Relative weights on lines or breeds of 
pigs ranged from 0.00625 to 0.25. SNPspace is based on 
a greedy algorithm, where SNP scores account for breed 
or line specific MAF, region of the genome, and position 
of SNP relative to previously selected tagSNP. Density of 
tagSNP was doubled within 5 Mbp of the chromosomal 
extremes, which has been shown to improve average accu- 
racy of imputation compared to tagSNP evenly spaced 
across the entire chromosome [1,28]. 

Increasing reference panel size 

To assess the effect of the number of reference haplotypes 
on imputation accuracy, we split the available sample of 
889 Yorkshire pigs into two groups: 1) a 200 animal testing 
panel, and 2) a 689 animal set of supplemental reference 
sires. Assignment to the two panels was random. 

To obtain imputation accuracy for a decreased set of 
reference haplotypes, we split the original 128 reference 
haplotypes obtained from 64 Yorkshire animals into two 
groups of 64 reference haplotypes (corresponding to 32 
animals) and estimated average imputation accuracy in 
the 200 animal testing set. Then, we split the two groups 
of 64 reference haplotypes further into two groups of 32 
reference haplotypes and obtained four estimates of impu- 
tation accuracy that were averaged into a single measure. 

Subsequently, we compared imputation accuracy using 
trio based reference panels to imputation accuracy based 
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on randomly sampled reference panels. To this end, we 
randomly sampled 16 animals from the 689 animal sup- 
plemental reference set and continued to add individuals 
at random to obtain reference sets of 24, 32, 48, 64, 96, 128, 
256, and 512 animals. Each of these sets was phased indi- 
vidually using BEAGLE [13] and then those haplotypes 
were used as reference panel to impute the 200 testing 
animals. 

Finally, we assembled reference panels of haplotypes 
combining the original 128 haplotypes from trios, with 
an increasing number of supplemental reference sires. To 
form these reference panels 64, 128, 192, and 448 supple- 
mental reference sires were randomly selected and phased 
using the trio haplotypes as a reference panel. Both, the 
trio reference haplotypes and an increasing number of 
supplementary reference haplotypes were then used to 
impute the 200 animal testing set. 

Because imputation accuracy was constant across chro- 
mosomes (see Results, section 3.1) we conducted this 
experiment on chromosome SSC14, a medium sized chro- 
mosome that has uniform coverage of SNP across its 
length. We expect results to extrapolate to all other chro- 
mosomes. 

Results 

Comparison of methods for tagSNP selection 

Due to the high computational demand, we initially per- 
formed a comparison of methods for tagSNP selection 
only on the smallest chromosome (SSC18). Statistical 
tagSNP selection requires fixing an r 2 threshold (rj ). Set- 
ting rj = 0.2 resulted in the selection of 165 tagSNP, 
which produced imputation accuracy of 0.936. Increasing 
rj to 0.3, led to a panel of 235 tagSNP and an increased 
imputation accuracy of 0.956. In comparison, imputa- 
tion accuracy based on 165 and 235 tagSNP selected for 
predictive ability was 0.93 and 0.945, respectively. Direct 
comparison to tagSNP sets selected for even spacing is 
more difficult because of empty intervals, for which no 
tagSNP were selected, resulting in smaller than targeted 
tagSNP sets. The evenly spaced tagSNP sets closest in 
size to 165 and 235 tagSNP were as expected slightly 
smaller (161 and 224 tagSNP), and the resulting imputa- 
tion accuracies were slightly lower than those obtained 
using the other sets (0.92 and 0.941, respectively). As 
expected, imputation accuracy increased with increasing 
densities of tagSNP regardless of the selection method 
(Figure 1). Statistically selected tagSNP performed slightly 
better than both, predictive and evenly spaced tagSNP 
(Figure 1), but all three methods resulted in similar impu- 
tation accuracy. Selection of tagSNP using predictive abil- 
ity required an at least 500-fold increase in computation 
time for SSC18 compared to statistical and evenly spaced 
selection. However, results of imputation accuracy indi- 
cate that predictive tagSNP did not yield significantly 



higher imputation accuracy compared to tagSNP selected 
by other methods. Therefore, only statistical and evenly 
spaced tagSNP were selected in an exhaustive evaluation 
of imputation accuracy across all autosomes (Figure 2). 

When imputing across all autosomes, as observed on 
SSC18, imputation accuracy using statistically selected 
tagSNP was slightly higher than that using evenly spaced 
tagSNP (Figure 2). In particular, to attain imputation accu- 
racy of 0.95, 7036 statistically selected tagSNP were nec- 
essary (rj = 0.3). In comparison, 10540 evenly spaced 
tagSNP were necessary to reach similar imputation accu- 
racy. Imputation accuracy was virtually uniform across 
chromosomes ranging from 0.92 to 0.94 for rj = 0.2 and 
from 0.94 to 0.96 for rj = 0.3. 

We computed imputation accuracy based on 7323 
tagSNP from the original list of 9K tagSNP provided 
by GeneSeek that passed quality control in this study 
(MAF > 0.05, CallRate > 0.9, assembled to an autosome 
under map buildlO) resulting in imputation accuracy of 
0.951 with a SNP-wise 95% highest posterior density inter- 
val equal to [0.84, 1] (Figure 2). Accuracy of imputation 
using the commercial tagSNP was similar to that obtained 
using statistically selected SNP, at comparable density 
(rj = 0.3, M tag sNP = 7036). The advantage of the pro- 
posed commercial platform is that it is not based on pop- 
ulation specific LD, thereby making it applicable across 
swine populations. For this reason all subsequent results 
of imputation accuracy will be based on the tagSNP ele- 
ment of the Genomic Profiler for Porcine LD. 

Imputation accuracy using the commercial 9K tagSNP set 

To assess accuracy of imputation as a function of chro- 
mosomal location we plotted imputation accuracy of each 
individual SNP versus chromosomal position (Figure 3). 
SNP within 5% of the chromosomal extremes had on aver- 
age slightly lower imputation accuracy (0.949) than the 
10% in the center of the chromosome (0.972). As men- 
tioned before, this property of imputation accuracy has 
previously been observed for other low density sets [1,28] 
and was anticipated during the tagSNP set design. Based 
on these reports [1,28], the density of tagSNP was approx- 
imately doubled within 5 Mbp of the chromosomal ends 
in the commercial 9K tagSNP set. 

Animal-wise imputation accuracy (I A,.) averaged 0.951 
but the corresponding highest posterior density interval 
([0.917,0.978]) was shorter than that observed for SNP- 
wise accuracy. Overall, all but 12 animals had imputation 
accuracy > 0.90 and 551 animals (62%) had imputation 
accuracy above 0.95. Also, seven of the animals had a 
dam, sire, or grand-sire in the reference panel [23], which 
resulted in on average higher accuracy of imputation in 
these animals (0.959). A group of 15 animals was identi- 
fied with consistently low imputation accuracy (i.e. < 0.91 
for 9K tagSNP) across all sets of tagSNP selected. An 
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Figure 3 SNP-wise imputation accuracy by chromosomal location. SNP-wise imputation accuracy (IA,) vs. the scaled chromosomal location of 
the SNP. The red line is the weighted mean average estimated using a loess smoother [29], and the green line represents average imputation 
accuracy {IA = 0.951). 



ongoing research project in our laboratory investigating 
breed composition, identified all of the 15 low accuracy 
individuals as potentially having mixed breed ancestry 
(Yijian Huang, unpublished data). Further assessing the 
pedigree of these 15 animals, we found that nine of them 
were imported to the US, which could result in a slightly 
different haplotype composition and the observed low 
accuracy of imputation, when only American Yorkshire 
pigs had been used as reference. Another three animals 
of the remaining six US Yorkshires with low imputation 
accuracy were identified as a family (sire, two offspring), 
such that the observed low accuracy in the offspring is 
likely a result of the mixed breed ancestry of their sire. 

As noted before, to assess the effect of MAF of imputed 
SNP on imputation accuracy required adjusting estimates 
of imputation accuracy for MAF. Imputation accuracy as 
a function of MAF is presented in Figure 4, where impu- 
tation accuracy was estimated as a) proportion of alleles 
correctly imputed, b) coefficient of determination (R 2 ) 
between observed and imputed allelic dosage [15,26], and 
c) proportion of alleles correctly imputed adjusted for 
MAF [16]. The red line in all plots represents the weighted 
mean average estimated using a loess smoother [29]. Loess 
consists of fitting smooth piecewise polynomial regres- 
sions to local subsets of data and it is widely used in 
normalization of micro-array experiments [30]. At first 



inspection, it can be seen that accuracy estimated as the 
proportion of correctly imputed alleles (Figure 4a) is high- 
est for low frequency alleles and exhibits a small decrease 
as MAF increases. However, the observed high proportion 
of correctly imputed alleles in SNP with low MAF is based 
on the fact that high frequency alleles can be imputed 
with high accuracy even if imputation is solely based on 
allele frequency [15,16]. For this reason, we computed R 2 
and the proportion of correctly imputed alleles adjusted 
for MAF that provide estimates of imputation accuracy 
unbiased by allele frequency. In other words, these mea- 
sures are indicative of the performance of the imputation 
algorithm in comparison to a baseline imputation based 
on genotypic frequencies [15,16]. When imputation accu- 
racy is adjusted for MAF, estimated accuracy is generally 
higher for intermediate allele frequencies (MAF ~ 0.5) 
and declines as MAF decreases (Figure 4 b/c). Average 
imputation accuracy considering only the added benefit of 
the imputation algorithm was lower (IAmaf = 0.91, R 2 = 
0.81) than the total proportion of correctly imputed alle- 
les (IA = 0.951). The difference between the proportion 
of correctly imputed alleles adjusted for MAF (Figure 4c) 
and estimates of R 2 (Figure 4b) can be explained by the dif- 
ference in error counting between these measures. While 
the proportion of correctly imputed alleles adjusted for 
MAF is obtained by counting the total number of wrongly 
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Figure 4 The measures of SNP-wise imputation accuracy by MAF. SNP-wise imputation accuracy computed as a) the proportion of correctly 
imputed alleles (//V), b) the correlation between imputed and observed allelic dosage (ft 2 ), and c) the proportion of correctly imputed alleles 
adjusted for MAF (IAmaf,), as a function of MAF of the SNP. The red line is the weighted mean average estimated using a loess smoother. 



imputed alleles, R 2 is obtained from the squared dif- 
ference in imputed and observed alleles, thereby more 
heavily penalizing large differences between observed and 
imputed allelic dosage. 

Effect of numbers of reference haplotypes on imputation 
accuracy 

For all previous analyses in this paper we imputed geno- 
types of 889 individuals across all autosomes using a ref- 
erence panel of 128 Yorkshire haplotypes obtained from 
a sire/dam/offspring genotyping design [23], phased with 
higher accuracy [22]. Reducing the number of imputation 
animals from 889 to 200 had no impact on the observed 
imputation accuracy. Imputation accuracy using all 128 
haplotypes from the original reference panel was 0.959 on 
SSC14, which reduced to 0.939 when 64 haplotypes were 
used, and further to 0.904 when imputation was based on 
32 haplotypes (Figure 5). Therefore, imputation accuracy 
larger than 0.90 can be obtained using the commercial 9K 
tagSNP set with a reference panel of only 32 haplotypes, 
given that these haplotypes were phased at high accuracy. 

We further investigated if it is necessary to obtain ref- 
erence haplotypes from a trio design, or if accuracy can 
be replicated using a reference panel of randomly sam- 
pled individuals genotyped at high density. In comparison 
to imputation accuracy obtained using a trio reference 
panel, imputation accuracy based on 32 and 64 reference 
haplotypes derived from selected sires was slightly lower 
(0.875 and 0.926, respectively). However, accuracy from 
128 reference haplotypes obtained from 64 randomly sam- 
pled individuals was 0.955, which is practically identical 
to results obtained using 128 reference haplotypes from 
trios. Therefore, if the reference panel of haplotypes is 
composed of more than 128 haplotypes, there is no longer 
an advantage in using haplotypes obtained from a trio 
design. Alternatively, the cost of assembling panels of 32, 



64, and 128 reference haplotypes obtained from a trio 
design involves the same genotyping cost as assembling 
panels of 48, 96, and 192 haplotypes obtained from ran- 
domly sampled individuals. This is due to the fact that 
in a trio design the offspring haplotypes are not used as 
part of the reference panel, since they are identical to 
the parents transmitted haplotypes. Imputation accura- 
cies for 48, 96, and 192 reference haplotype panels from 
randomly sampled individuals were estimated to be 0.906, 
0.947, and 0.965, respectively, which is either equivalent 
or higher than accuracy of imputation obtained using the 
cost equivalent trio based reference panels (Figure 5). In 
addition, we compared imputation accuracy from refer- 
ence haplotypes of either 64 randomly selected individuals 
or the 64 oldest individuals and found no difference in 
imputation accuracy (0.956, 0.953 respectively). Conse- 
quently, if no reference panel of haplotypes is available 
for a population, according to the results of this study, 
it would be most cost efficient to assemble high density 
haplotypes of randomly sampled individuals. 

Previous research has indicated an increase in impu- 
tation accuracy can be expected as the number of avail- 
able reference haplotypes increases to a certain point 
[20,21]. We added randomly selected individuals to the 
reference panel and obtained 256, 512, and 1024 refer- 
ence haplotypes. These panels resulted in average accu- 
racy of imputation of 0.971, 0.978, and 0.985 respectively 
(Figure 5). Imputation accuracy only marginally increased 
when more than 256 haplotypes were used as reference 
panel for imputation (up to 1.4% gain). Additionally, we 
assessed accuracy of imputation from reference panels 
composed of the original 128 reference haplotypes from 
a trio design and an increasing number of randomly sam- 
pled individuals added to that panel. In this case, impu- 
tation accuracy based on reference panels with 256 and 
512 haplotypes was 0.969 and 0.978 respectively, which is 
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Figure 5 Effect of number of reference haplotypes on imputation accuracy. Average imputation accuracy (IA) as a function of the number of 
haplotypes in the reference panel used for imputation. Imputation accuracy was estimated for reference panels composed of haplotypes from a trio 
design (blue triangle), reference panels composed of haplotypes from randomly sampled sires (red circle), and reference panels composed of both 
haplotypes from a trio design and haplotypes from randomly sampled sires (black circle). 



virtually identical to results obtained using reference pan- 
els solely from randomly sampled individuals (Figure 5). 

In addition to assessing the effect of an increased num- 
ber of reference haplotypes on average accuracy, we also 
investigated how it affects individual SNP with differ- 
ent MAF and physical location. We found that as the 
size of the reference panel increases, imputation accuracy 
(quantified as R 1 ) improved more markedly for SNP with 
MAF below 0.1, such that when the size of the reference 
panel is increased from 256 haplotypes to 512 haplotypes 
the increase in accuracy for SNP with MAF below 0.1 
was on average 0.06 points, while for all other SNP the 
increase was only 0.02 points (Additional file 1: Figure 
SI). When imputation was based on 1024 reference hap- 
lotypes imputation accuracy appears to be uniform across 
allele frequencies. Similarly, we observed that imputa- 
tion accuracy (proportion of correctly imputed alleles) 
for SNP located in the 10% chromosomal extremes (5% 
on either side) could be improved through an increase 
in the number of reference haplotypes (Additional file 2: 
Figure S2). A reference panel containing 512 haplotypes 
was necessary to obtain maximal imputation accuracy 
(IA = 0.99) for SNP located in the chromosomal center, 
while SNP in the chromosomal extremes were imputed 
with accuracy of only 0.97, even when the number of 



reference haplotypes was doubled (1024 reference haplo- 
types). Imputation accuracy observed in SNP located in 
the chromosomal extremes was more than 0.02 accuracy 
units lower than the average imputation accuracy of all 
remaining SNP irrespective of the reference panel size. 

Discussion 

Methods for tagSNP selection 

Current algorithms for genotype imputation exploit 
population-wise LD [13,14], familial LD from identity by 
descent [31], or a combination of both [32] to infer unob- 
served genotypes conditional on tagSNP information. Vir- 
tually all methods for tagSNP selection aim at identifying 
tagSNP that carry the maximum amount of information 
to impute unobserved markers. This is attained by either 
directly quantifying the tagSNP ability to predict non- 
typed SNP (predictive tagSNP selection) or indirectly by 
selecting tagSNP in high pairwise LD with non-tagSNP 
(statistical tagSNP selection) [17]. 

A goal of this study was to select a minimal set of 
tagSNP that would yield acceptable accuracy of imputa- 
tion of non-tagSNP [10,28]. Since genotype imputation 
utilizes information about the structure of LD to infer 
non-observed SNP, we expected that tagSNP sets selected 
based on LD information, such as statistical and predictive 
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tagSNP selection, would yield higher accuracy of imputa- 
tion than tagSNP selected based solely on their physical 
location. In addition, we expected that directly assess- 
ing the ability of each tagSNP to predict non-observed 
SNP (predictive selection) would yield an improvement in 
imputation accuracy compared to tagSNP selected purely 
based on pairwise thresholds of LD (statistical selection). 

We found that at the lowest examined tagSNP density 
(1 tagSNP per Mb) accuracy of imputation was below 0.87 
irrespective of the method of tagSNP selection and that at 
least 2 tagSNP per Mb were necessary to increase accu- 
racy to at least 0.91. Accuracy of imputation increased as 
tagSNP density increased, reaching a plateau accuracy of 
approximately 0.98 when tagSNP were spaced at an aver- 
age distance of less than 125kb with negligible increases 
beyond such density. Our results compare well to those 
of Weigel et al. [33], where randomly selected tagSNP at 
an approximate density of 300kb were necessary to obtain 
accuracy larger than 0.90 in the US Jersey cattle popu- 
lation using a similar type of imputation. In our study, 
imputation accuracy of approximately 0.95 was obtained 
using between 7000 (average tagSNP spacing of 340kb) 
and 10000 tagSNP (average tagSNP spacing of 230kb), 
depending on the method of tagSNP selection. 

As expected, predictively and statistically selected 
tagSNP did yield higher accuracy of imputation than 
evenly spaced tagSNP, but we found no difference in 
imputation accuracy between tagSNP sets selected sta- 
tistically or based on predictive ability. Comparing 300 
tagSNP selected using predictive ability to 317 tagSNP 
obtained using statistical selection (r 2 = 0.4) on SSC18, 
we observed the same imputation accuracy (I A = 0.95). 
However, the two sets are qualitatively different. For 
instance, the 300 predictive tagSNP only provide statisti- 
cal coverage (r 2 < 0.4) to 37% of non-tagSNP The tagSNP 
sets also have on average different MAF (MAF pre dictive = 
0.30, MAF statistica i = 0.27). We attribute the equivalence 
in imputation accuracy of two different tagSNP sets to 
the extent of LD observed across the genome in Yorkshire 
pigs (r 2 = 0.16 at 1 Mb) [23]. Under these conditions, 
precision of estimates of individual tagSNP imputation 
accuracy is likely compromised by collinearity, making 
selection of a single best predictive tagSNP at each step 
of the forward search complicated [34]. For example, the 
initial step of the forward search for predictive tagSNP 
resulted in six SNP with predictive ability within 0.002 
accuracy units of each other. Each of these SNP could 
have been selected as a starting point of the greedy search, 
resulting in different sets of tagSNP selected. Further- 
more, the implemented predictive forward search requires 
MiiMi+D-MtiiMu+i) imputation operations per iteration 

step compared to only two with statistical selection, where 
Mi is the number of SNP per chromosome and M t i is 
the number of selected tagSNP on that chromosome. 



Consequently, even though both methods result in differ- 
ent tagSNP sets, statistical selection is a computationally 
efficient proxy for predictive tagSNP selection when mod- 
erate LD between consecutive markers is present. 

We show that tagSNP sets strictly selected for even spac- 
ing are slightly outperformed by statistical or predictive 
tagSNP selection. However, it is possible to enhance the 
performance of evenly spaced tagSNP through a few sim- 
ple measures. TagSNP with high MAF seem to be advan- 
tageous for genotype imputation (predictive tagSNP selec- 
tion seemed to favor tagSNP with high MAF) and their 
likelihood to segregate across populations will ensure that 
they carry information for imputation in various popu- 
lations. This has been exploited previously in cattle for 
the assembly of the 3K platform [28], as well as in newer 
tagSNP sets aimed to further increase imputation accu- 
racy [1]. In addition to selecting evenly spaced tagSNP 
with high MAF, an increase in accuracy can be obtained by 
increasing tagSNP density in the chromosomal extremes 
[1]. The success of these enhancements of evenly spaced 
tagSNP is evident in the imputation accuracy we report 
using the commercial 9K set {M ta gSNP = 7323, IA = 
0.951), which is similar to results we found for statistical 
tagSNP sets for thresholds r f 2 = 0.3 {M tag sNP = 7036, 
IA = 0.952). In addition, although the recently released 
commercially available chip has approximately 10% fewer 
tagSNP than the original 9K tagSNP list that was used for 
this analysis, our conclusions regarding imputation accu- 
racy are likely to uphold, due to the fact that we based our 
analysis on a set of only 7323 tagSNP, which should be rep- 
resentative of the number of commercial tagSNP that will 
pass quality control in future study samples. 

In summary, efficient tagSNP selection based on MAF 
and physical location is feasible and more flexible than sta- 
tistical tagSNP selection. Selecting evenly spaced tagSNP 
with high MAF requires knowledge of the physical loca- 
tion of the SNP and the MAF across populations of inter- 
est, while statistical tagSNP selection requires knowledge 
of the LD structure, and would be population specific. As 
a result, selecting a tagSNP set with high MAF and an 
increased density in the chromosomal extremes is more 
versatile than tagSNP sets selected for predictive ability or 
based on statistical criteria while yielding the same accu- 
racy of imputation. In addition, the tagSNP set selected 
based on physical location and MAF is expected to be use- 
ful for imputation as long as the 60K chip is being used 
for genomic selection, because we do not expect selection 
to alter LD or MAF of selected SNP in any particular way. 
If such tagSNP sets will be used across multiple closely 
related populations it will be necessary to include a num- 
ber of SNP that will be specific to a subset of populations. 
In the case of the 9K tagSNP set more than 9000 tagSNP 
were selected based on MAF across several populations 
and physical location of the SNP, but only 7323 of these 
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SNP passed quality editing for the Yorkshire data in this 
study. 

Factors affecting imputation accuracy 

Accuracy of imputation is affected by several factors 
including the selection and density of tagSNP as detailed 
above, the MAF and the physical location of the imputed 
SNP, as well as the size and composition of the reference 
panel. 

When evaluating imputation accuracy as a function of 
the tagSNP selection method and density we have focused 
on average accuracy as a measure of overall performance. 
Assessing the average accuracy of imputation is a good 
indicator of the performance of imputed genotypes, when 
all genotypes are used simultaneously to obtain a global 
measure. Such a measure could be prediction of GEBV, 
which would be based on all SNP simultaneously, such 
that a small number of wrongly imputed SNP is unlikely 
to greatly affect the accuracy of prediction. Alternatively, 
some applications of imputed high density genotypes may 
require high accuracy across all SNP. One example would 
be GWAS based on imputed genotypes. For GWAS, SNP 
associations are assessed on a SNP by SNP basis, such that 
wrongly imputed alleles for low frequency SNP are more 
likely to cause bias in the estimated association, especially 
since phenotypes of interest are suspected to be associated 
with low frequency alleles [20]. 

One of the factors directly related to the individual SNP 
imputation accuracy, is the allele frequency of that partic- 
ular SNP. To investigate imputation accuracy as a function 
of MAF we used two measures of imputation accuracy 
that were unbiased by MAF (i.e. IAmaf, R 2 )- The adjusted 
proportion of correctly imputed alleles (IAmaf) and the 
correlation between observed and imputed allelic dosage 
(R 2 ) are scaled differently, such that the observed accuracy 
differs as a function of scale, but the comparative differ- 
ence in imputation accuracy as a function of MAF can be 
observed using either of the two accuracy measures. We 
found that estimates of imputation accuracy adjusted for 
MAF {R 2 , IAmaf) are lower {R 2 = 0.73, IAmaf = 0.89) 
for SNP with MAF below 0.1, compared to SNP with MAF 
above 0.1 (R 2 = 0.82, IAmaf = 0.91), which has been 
previously noted by Hayes et al. [16] reporting results of 
genotype imputation in sheep and Hickey et al. [15] in 
lines of maize. 

Another factor relating to individual SNP imputation 
accuracy is the physical location of the SNP. Previous 
studies designing low density genotyping platforms have 
pointed out the need to increase coverage of tagSNP 
in the chromosomal extremes due to difficulties in cor- 
rectly imputing SNP located in those regions [1,28]. In 
the commercial 9K tagSNP set the density of tagSNP 
within 5Mbp of the chromosomal extremes was approxi- 
mately doubled to aid imputation accuracy. We found that 



imputation accuracy using the 9K commercial tagSNP 
was still slightly lower in the extreme regions (0.949) 
when compared to the chromosome center (0.972). The 
effect however was alleviated in comparison to an equally 
spaced tagSNP set of comparable density, where the aver- 
age imputation accuracy in the chromosomal extremes 
was only 0.89. 

We found a group of 15 animals that produced consis- 
tently low imputation accuracy (IA < 0.90), compared to 
all remaining animals {IA = 0.951). Nine of these ani- 
mals were identified as imports, such that the observed 
low accuracy of imputation is likely a result of differences 
in haplotype frequencies between the US Yorkshire popu- 
lation that was used as reference for imputation, and the 
population(s) from which these animals originated. The 
remaining six animals were all identified as having poten- 
tially mixed breed ancestry based on results of a con- 
current research project in our laboratory (Yijian Huang, 
unpublished data). In addition, we can infer from these 
results that if a population contains heterogeneous sub- 
populations, such as a large number of imported animals 
or animals with cross-bred ancestry, imputation accuracy 
will be decreased if this sub-structure is not accounted for 
when sampling reference haplotypes. 

We found that increasing the number of reference hap- 
lotypes led to an increase in average imputation accu- 
racy. In addition to the number of reference haplotypes, 
their average relatedness to the imputation candidates 
[15,16,35,36], as well as accurate phasing of these haplo- 
types [37] directly affect the resulting accuracy of impu- 
tation. In this paper, we assessed the effect of phas- 
ing accuracy and the number of reference haplotypes. 
Previous research comparing phasing accuracy of unre- 
lated or randomly sampled individuals and trio designs 
(sire/dam/offspring), found that genotypes from trios can 
be phased with higher accuracy [22]. The initial refer- 
ence panel available in this study was composed of the 
haplotypes of sire/dam pairs from a previous sample of 
trios that were unrelated for at least two generations 
and therefore sampled to efficiently cover the Yorkshire 
population [23]. We found that for haplotype panels com- 
posed of 64 or less haplotypes imputation accuracy was 
higher when these haplotypes were obtained from the 
trio design rather than a random sample of individuals 
(Figure 5). This advantage of the trio design is likely due 
to the superior phasing accuracy as well as the sampling 
strategy used to obtain these samples. However, adjust- 
ing sample size for the increased genotyping cost in a 
trio design, we observed that imputation accuracy was 
equal or higher when imputation was based on haplotypes 
obtained from randomly sampled individuals instead of 
trio reference haplotypes (Figure 5). Therefore, we con- 
clude that if no reference panel is available in a popula- 
tion the most cost efficient method for reference panel 
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construction is genotyping a random sample of individu- 
als across the population. 

Next we assessed imputation accuracy as a function of 
increasing reference panel size. We found that a reference 
panel of 256 to 512 reference haplotypes is sufficient to 
obtain imputation accuracy of IA = 0.97. If the size of 
the reference panel is increased beyond 1024 haplotypes 
(IA = 0.985) any further gain in imputation accuracy 
appears to be very small. A similar type of response has 
been observed in human genotype imputation [21]. This 
relatively small number of reference haplotypes neces- 
sary to obtain high imputation accuracy (IA = 0.97) 
is likely due to the relatively small effective population 
size of the Yorkshire population (N e = 113, [38]), and 
consequently high average LD even at decreased tagSNP 
density. 

After determining that increasing the size of the ref- 
erence panel would increase accuracy of imputation, we 
assessed whether accuracy would differ as a function of 
reference panel composition. In general, when imputation 
experiments are conducted the older animals are used as 
reference panel, while the younger animals serve as impu- 
tation candidates [33,39]. We assessed whether imputa- 
tion accuracy would differ depending on the reference 
panel being composed of randomly selected individuals or 
older individuals and found no advantage in imputation 
accuracy when selecting a reference panel composed of 
older animals. 

In addition to the observed increase in overall impu- 
tation accuracy, we found that increasing the size of 
the reference panel is especially efficient at increasing 
the individual imputation accuracy of SNP that exhib- 
ited below average imputation accuracy [20]. SNP with 
MAF below 0.1 were imputed poorly in comparison to 
SNP with MAF above 0.1 (accuracy measure R 2 , IAmaf), 
but as reference panel size increased imputation accu- 
racy of these SNP improved, and for imputation based 
on 1024 haplotypes we observed a uniform distribution 
of imputation accuracy (quantified as R 2 ) across levels of 
MAF (Additional file 1: Figure SI). An increase in the 
size of the reference panel increases the precision of esti- 
mated frequencies of haplotypes containing rare alleles, 
which appears to more efficiently boost imputation accu- 
racy for the corresponding SNP [20]. SNP located in the 
10% chromosomal extremes (5% on either side) also had 
on average lower imputation accuracy than the remain- 
ing SNP. As reference panel size was increased very little 
improvement could be observed in imputation accuracy 
of SNP located in the center of the chromosome, due to 
these SNP already being imputed with accuracy close to 
1. However, imputation accuracy of SNP in the chromo- 
some ends improved as reference haplotypes were added 
to the panel, until reaching accuracy within 0.02 points of 
the average imputation accuracy of SNP in the remainder 



of the chromosome for imputation based on a reference 
panel containing 1024 haplotypes (Additional file 2: Figure 
S2). 

Although, we did not assess the accuracy of GEBV pre- 
diction based on imputed genotypes in this paper, we 
can use results from dairy cattle breeding that show the 
promise of imputed genotypes to predict GEBV. Based on 
the average imputation accuracy we observed for York- 
shire pigs and previous results for GEBV prediction based 
on imputed genotypes in dairy cattle we could expect 
that losses in accuracy of GEBV prediction as a result 
of genotype imputation will be negligible. Wiggans et al. 
[40] and Dassonneville et al. [9] reported correlation of 
GEBV from imputed genotypes (IA > 0.96) with GEBV 
estimated from high density genotypes larger than 0.93. 
Moreover, Weigel et al. [10], reported a loss in accuracy 
of GEBV, estimated as the correlation between GEBV and 
direct genomic value, between 0 and 5% when using geno- 
types imputed with low accuracy (IA = 0.91). Since our 
estimates of imputation accuracy in the Yorkshire popula- 
tion are within the range of those reported in dairy cattle 
[9,10,33,40], we expect GEBV estimated from imputed 
genotypes in Yorkshire pigs to be as accurate as those cur- 
rently used in the diary breeding industry. Furthermore, 
these results are expected to hold in other swine breeds 
with similar levels of LD [23]. 

Conclusion 

In conclusion, high (IA > 0.95) genotype imputation 
accuracy can be achieved in pigs combining the newly 
available commercial 9K tagSNP set and a relatively small 
reference haplotype panel (128 haplotypes), even when 
imputation is based only on population-wide LD. Further 
improvements in imputation accuracy could be achieved 
through the inclusion of additional reference animals 
(IA = 0.97 with 512 reference haplotypes) and the use 
of pedigree relations between reference and imputation 
animals in the imputation algorithm [35,36,40]. An impor- 
tant result from this study is that an efficient design for 
reference panel construction is randomly sampling indi- 
viduals instead of specifically sampling older animals or 
trios. In addition, a relatively small panel of reference hap- 
lotypes (> 128) can efficiently serve as a reference panel 
for genotype imputation, such that any available high 
density genotypes in a livestock population could poten- 
tially serve this purpose. For the pig species such panels 
are already available for several populations [23]. Finally, 
prospects for the use of imputed genotypes in GEBV pre- 
diction are very positive based on the results from dairy 
breeding that routinely use similarly accurately imputed 
genotypes for genomic evaluation [9,10,40]. The method- 
ology used in this paper for construction of tagSNP sets 
and reference haplotype panels can be easily applied in 
any future study population. Code and data to obtain and 
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reproduce the results presented is publicly available at 
https://www.msu.edu/~steibelj/JP_nles/imputation.html. 



Additional files 

Additional file 1: Figure SI. Effect of reference panel size on 
imputation accuracy of SNP as a function of their MAF. Weighted 
mean average imputation accuracy (quantified as R 2 ) as a function of MAF 
depicted for imputation based on haplotype reference panels of increasing 
size. 

Additional file 2: Figure S2. Effect of reference panel size on 
imputation accuracy of SNP as a function of scaled physical location. 
Weighted mean average imputation accuracy (quantified as IA) as a 
function of the scaled chromosomal location for imputation based on 
haplotype reference panels of increasing size. 
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